Goto

Collaborating Authors

 overestimation bias





Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

Neural Information Processing Systems

Training offline RL models using visual inputs poses two significant challenges,, the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the " " for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins.


SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning

Neural Information Processing Systems

Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.






To Reviewer # 3: Thank you for your careful reading and thoughtful reviews

Neural Information Processing Systems

T o Reviewer #3: Thank you for your careful reading and thoughtful reviews. Q1: Theorems 3 and 4. (i) Theorem 3: Theorem 3 shows that SD2 helps to reduce the overestimation bias compared We empirically show that SD2 does not underestimate and can reduce the absolute bias in Figure 4. The left-hand side in Eq. (19) equals to Q5: How is the performance of the proposed approximation method? We will try to further investigate it in future research. Q2: Related works about ensemble methods.